{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "26e50a28",
   "metadata": {},
   "source": [
    "# Exploratory analysis\n",
    "\n",
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/moj-analytical-services/splink/blob/master/docs/demos/tutorials/02_Exploratory_analysis.ipynb\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
    "</a>\n",
    "\n",
    "Exploratory analysis helps you understand features of your data which are relevant linking or deduplicating your data.\n",
    "\n",
    "Splink includes a variety of charts to help with this, which are demonstrated in this notebook.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96a3d08d",
   "metadata": {},
   "source": [
    "### Read in the data\n",
    "\n",
    "For the purpose of this tutorial we will use a 1,000 row synthetic dataset that contains duplicates.\n",
    "\n",
    "The first five rows of this dataset are printed below.\n",
    "\n",
    "Note that the cluster column represents the 'ground truth' - a column which tells us with which rows refer to the same person. In most real linkage scenarios, we wouldn't have this column (this is what Splink is trying to estimate.)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "09c3966a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-17T07:58:30.332066Z",
     "iopub.status.busy": "2024-07-17T07:58:30.331696Z",
     "iopub.status.idle": "2024-07-17T07:58:30.350746Z",
     "shell.execute_reply": "2024-07-17T07:58:30.349394Z"
    },
    "tags": [
     "hide_input"
    ]
   },
   "outputs": [],
   "source": [
    "# Uncomment and run this cell if you're running in Google Colab.\n",
    "# !pip install splink"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ffceed65",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-17T07:58:30.355541Z",
     "iopub.status.busy": "2024-07-17T07:58:30.355229Z",
     "iopub.status.idle": "2024-07-17T07:58:32.284766Z",
     "shell.execute_reply": "2024-07-17T07:58:32.283944Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unique_id</th>\n",
       "      <th>first_name</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>city</th>\n",
       "      <th>email</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Alan</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>robert255@smith.net</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Allen</td>\n",
       "      <td>1971-05-24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>roberta25@smith.net</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Rob</td>\n",
       "      <td>Allen</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>London</td>\n",
       "      <td>roberta25@smith.net</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Alen</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>Lonon</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>Grace</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1997-04-26</td>\n",
       "      <td>Hull</td>\n",
       "      <td>grace.kelly52@jones.com</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   unique_id first_name surname         dob    city                    email\n",
       "0          0     Robert    Alan  1971-06-24     NaN      robert255@smith.net\n",
       "1          1     Robert   Allen  1971-05-24     NaN      roberta25@smith.net\n",
       "2          2        Rob   Allen  1971-06-24  London      roberta25@smith.net\n",
       "3          3     Robert    Alen  1971-06-24   Lonon                      NaN\n",
       "4          4      Grace     NaN  1997-04-26    Hull  grace.kelly52@jones.com"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from splink import  splink_datasets\n",
    "\n",
    "df = splink_datasets.fake_1000\n",
    "df = df.drop(columns=[\"cluster\"])\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d982939",
   "metadata": {},
   "source": [
    "## Analyse missingness\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfab11e8",
   "metadata": {},
   "source": [
    "It's important to understand the level of missingness in your data, because columns with higher levels of missingness are less useful for data linking.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6dae307c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-17T07:58:32.288838Z",
     "iopub.status.busy": "2024-07-17T07:58:32.288542Z",
     "iopub.status.idle": "2024-07-17T07:58:32.838324Z",
     "shell.execute_reply": "2024-07-17T07:58:32.837789Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<style>\n",
       "  #altair-viz-11a3a76fa67545d88798bd37d5d82306.vega-embed {\n",
       "    width: 100%;\n",
       "    display: flex;\n",
       "  }\n",
       "\n",
       "  #altair-viz-11a3a76fa67545d88798bd37d5d82306.vega-embed details,\n",
       "  #altair-viz-11a3a76fa67545d88798bd37d5d82306.vega-embed details summary {\n",
       "    position: relative;\n",
       "  }\n",
       "</style>\n",
       "<div id=\"altair-viz-11a3a76fa67545d88798bd37d5d82306\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-11a3a76fa67545d88798bd37d5d82306\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-11a3a76fa67545d88798bd37d5d82306\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.17.0?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"5.17.0\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 400, \"continuousHeight\": 300}}, \"layer\": [{\"mark\": \"rect\", \"encoding\": {\"color\": {\"field\": \"completeness\", \"legend\": null, \"scale\": {\"scheme\": \"darkred\", \"zero\": true}, \"type\": \"quantitative\"}, \"tooltip\": [{\"field\": \"source_dataset\", \"title\": \"Source dataset\", \"type\": \"nominal\"}, {\"field\": \"total_rows_inc_nulls\", \"format\": \",\", \"title\": \"# of records\", \"type\": \"quantitative\"}, {\"field\": \"column_name\", \"title\": \"Column name\", \"type\": \"nominal\"}, {\"field\": \"total_null_rows\", \"format\": \",\", \"title\": \"# of nulls\", \"type\": \"quantitative\"}, {\"field\": \"completeness\", \"format\": \".1%\", \"type\": \"quantitative\"}], \"x\": {\"axis\": {\"labelAngle\": 20}, \"field\": \"column_name\", \"sort\": {\"field\": \"mean_comp\", \"order\": \"descending\"}, \"title\": \"Column name\", \"type\": \"nominal\"}, \"y\": {\"field\": \"source_dataset\", \"title\": \"Source dataset\", \"type\": \"nominal\"}}, \"title\": \"Column completeness by source dataset\", \"transform\": [{\"joinaggregate\": [{\"op\": \"mean\", \"field\": \"completeness\", \"as\": \"mean_comp\"}], \"groupby\": [\"column_name\"]}]}, {\"mark\": {\"type\": \"text\"}, \"encoding\": {\"color\": {\"condition\": {\"test\": \"datum['completeness'] < 0.5\", \"value\": \"white\"}, \"value\": \"black\"}, \"text\": {\"field\": \"completeness\", \"format\": \".0%\", \"type\": \"quantitative\"}, \"x\": {\"axis\": {\"labelAngle\": 0}, \"field\": \"column_name\", \"sort\": {\"field\": \"mean_comp\", \"order\": \"descending\"}, \"type\": \"nominal\"}, \"y\": {\"field\": \"source_dataset\", \"type\": \"nominal\"}}, \"transform\": [{\"joinaggregate\": [{\"op\": \"mean\", \"field\": \"completeness\", \"as\": \"mean_comp\"}], \"groupby\": [\"column_name\"]}]}], \"data\": {\"name\": \"data-3c49970cc0ec9c32e2509f8e79608056\"}, \"height\": {\"step\": 40}, \"width\": {\"step\": 40}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.9.3.json\", \"datasets\": {\"data-3c49970cc0ec9c32e2509f8e79608056\": [{\"source_dataset\": \"input_data_1\", \"column_name\": \"unique_id\", \"total_null_rows\": 0, \"total_rows_inc_nulls\": 1000, \"completeness\": 1.0}, {\"source_dataset\": \"input_data_1\", \"column_name\": \"first_name\", \"total_null_rows\": 169, \"total_rows_inc_nulls\": 1000, \"completeness\": 0.8309999704360962}, {\"source_dataset\": \"input_data_1\", \"column_name\": \"surname\", \"total_null_rows\": 181, \"total_rows_inc_nulls\": 1000, \"completeness\": 0.8190000057220459}, {\"source_dataset\": \"input_data_1\", \"column_name\": \"dob\", \"total_null_rows\": 0, \"total_rows_inc_nulls\": 1000, \"completeness\": 1.0}, {\"source_dataset\": \"input_data_1\", \"column_name\": \"city\", \"total_null_rows\": 187, \"total_rows_inc_nulls\": 1000, \"completeness\": 0.8130000233650208}, {\"source_dataset\": \"input_data_1\", \"column_name\": \"email\", \"total_null_rows\": 211, \"total_rows_inc_nulls\": 1000, \"completeness\": 0.7889999747276306}]}}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.LayerChart(...)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from splink.exploratory import completeness_chart\n",
    "from splink import DuckDBAPI\n",
    "db_api = DuckDBAPI()\n",
    "completeness_chart(df, db_api=db_api)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c35a4e0",
   "metadata": {},
   "source": [
    "The above summary chart shows that in this dataset, the `email`, `city`, `surname` and `forename` columns contain nulls, but the level of missingness is relatively low (less than 22%).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f11cc6b6",
   "metadata": {},
   "source": [
    "## Analyse the distribution of values in your data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "973cb505",
   "metadata": {},
   "source": [
    "The distribution of values in your data is important for two main reasons:\n",
    "\n",
    "1. Columns with higher cardinality (number of distinct values) are usually more useful for data linking. For instance, date of birth is a much stronger linkage variable than gender.\n",
    "\n",
    "2. The skew of values is important. If you have a `city` column that has 1,000 distinct values, but 75% of them are `London`, this is much less useful for linkage than if the 1,000 values were equally distributed\n",
    "\n",
    "The `profile_columns()` method creates summary charts to help you understand these aspects of your data.\n",
    "\n",
    "To profile all columns, leave the column_expressions argument empty.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "897d183c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-17T07:58:32.842112Z",
     "iopub.status.busy": "2024-07-17T07:58:32.841826Z",
     "iopub.status.idle": "2024-07-17T07:58:33.969583Z",
     "shell.execute_reply": "2024-07-17T07:58:33.968678Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<style>\n",
       "  #altair-viz-91bbabb769ff42319665d710e561acab.vega-embed {\n",
       "    width: 100%;\n",
       "    display: flex;\n",
       "  }\n",
       "\n",
       "  #altair-viz-91bbabb769ff42319665d710e561acab.vega-embed details,\n",
       "  #altair-viz-91bbabb769ff42319665d710e561acab.vega-embed details summary {\n",
       "    position: relative;\n",
       "  }\n",
       "</style>\n",
       "<div id=\"altair-viz-91bbabb769ff42319665d710e561acab\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-91bbabb769ff42319665d710e561acab\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-91bbabb769ff42319665d710e561acab\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.17.0?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"5.17.0\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 400, \"continuousHeight\": 300}}, \"vconcat\": [{\"hconcat\": [{\"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"data\": {\"values\": [{\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.0, \"value_count\": 1, \"group_name\": \"_unique_id_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 1000.0, \"distinct_value_count\": 1000}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 1, \"group_name\": \"_unique_id_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 1000.0, \"distinct_value_count\": 1000}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\", \"type\": \"quantitative\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Count of values\", \"type\": \"quantitative\"}}, \"title\": {\"text\": \"Distribution of counts of values in column \\\"unique_id\\\"\", \"subtitle\": \"In this col, 0 values (0.0%) are null and there are 1000 distinct values\"}}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"0\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"11\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"15\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"16\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"22\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"26\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"31\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"32\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"52\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"55\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Top 10 values by value count\"}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"0\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"11\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"15\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"16\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}, {\"value_count\": 1, \"group_name\": \"_unique_id_\", \"value\": \"22\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 1000}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"scale\": {\"domain\": [0, 1]}, \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Bottom 5 values by value count\"}]}, {\"hconcat\": [{\"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"data\": {\"values\": [{\"percentile_ex_nulls\": 0.966305673122406, \"percentile_inc_nulls\": 0.972000002861023, \"value_count\": 28, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 28.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.9470517635345459, \"percentile_inc_nulls\": 0.9559999704360962, \"value_count\": 16, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 16.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.9290012121200562, \"percentile_inc_nulls\": 0.9409999847412109, \"value_count\": 15, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 15.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.9121540188789368, \"percentile_inc_nulls\": 0.9269999861717224, \"value_count\": 14, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 14.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.8965102434158325, \"percentile_inc_nulls\": 0.9139999747276306, \"value_count\": 13, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 13.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.8820698261260986, \"percentile_inc_nulls\": 0.9020000100135803, \"value_count\": 12, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 12.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.8423585891723633, \"percentile_inc_nulls\": 0.8690000176429749, \"value_count\": 11, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 33.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.818291187286377, \"percentile_inc_nulls\": 0.8489999771118164, \"value_count\": 10, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 20.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.7641395926475525, \"percentile_inc_nulls\": 0.8040000200271606, \"value_count\": 9, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 45.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.7352586984634399, \"percentile_inc_nulls\": 0.7799999713897705, \"value_count\": 8, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 24.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.701564371585846, \"percentile_inc_nulls\": 0.7519999742507935, \"value_count\": 7, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 28.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.6293622255325317, \"percentile_inc_nulls\": 0.6920000314712524, \"value_count\": 6, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 60.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.539109468460083, \"percentile_inc_nulls\": 0.6169999837875366, \"value_count\": 5, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 75.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.42358607053756714, \"percentile_inc_nulls\": 0.5210000276565552, \"value_count\": 4, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 96.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.33694344758987427, \"percentile_inc_nulls\": 0.4490000009536743, \"value_count\": 3, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 72.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.23826712369918823, \"percentile_inc_nulls\": 0.3669999837875366, \"value_count\": 2, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 82.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.1690000295639038, \"value_count\": 1, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 198.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 28, \"group_name\": \"_first_name_\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 28.0, \"distinct_value_count\": 335}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\", \"type\": \"quantitative\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Count of values\", \"type\": \"quantitative\"}}, \"title\": {\"text\": \"Distribution of counts of values in column \\\"first_name\\\"\", \"subtitle\": \"In this col, 169 values (16.9%) are null and there are 335 distinct values\"}}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 28, \"group_name\": \"_first_name_\", \"value\": \"Oliver\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 16, \"group_name\": \"_first_name_\", \"value\": \"Jacob\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 15, \"group_name\": \"_first_name_\", \"value\": \"Freddie\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 14, \"group_name\": \"_first_name_\", \"value\": \"Olivia\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 13, \"group_name\": \"_first_name_\", \"value\": \"James\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 12, \"group_name\": \"_first_name_\", \"value\": \"George\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 11, \"group_name\": \"_first_name_\", \"value\": \"Elizabeth\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 11, \"group_name\": \"_first_name_\", \"value\": \"Jessica\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 11, \"group_name\": \"_first_name_\", \"value\": \"Alfie\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 10, \"group_name\": \"_first_name_\", \"value\": \"Logan\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Top 10 values by value count\"}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"_first_name_\", \"value\": \"Hall\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 1, \"group_name\": \"_first_name_\", \"value\": \"Hayler\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 1, \"group_name\": \"_first_name_\", \"value\": \"Edard\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 1, \"group_name\": \"_first_name_\", \"value\": \"Harr\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 1, \"group_name\": \"_first_name_\", \"value\": \"iElliot\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"scale\": {\"domain\": [0, 28]}, \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Bottom 5 values by value count\"}]}, {\"hconcat\": [{\"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"data\": {\"values\": [{\"percentile_ex_nulls\": 0.66300368309021, \"percentile_inc_nulls\": 0.7239999771118164, \"value_count\": 6, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 48.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.6080585718154907, \"percentile_inc_nulls\": 0.6790000200271606, \"value_count\": 5, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 45.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.5054944753646851, \"percentile_inc_nulls\": 0.5950000286102295, \"value_count\": 4, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 84.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.39560437202453613, \"percentile_inc_nulls\": 0.5049999952316284, \"value_count\": 3, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 90.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.28815627098083496, \"percentile_inc_nulls\": 0.4169999957084656, \"value_count\": 2, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 88.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.1809999942779541, \"value_count\": 1, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 236.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.9719169735908508, \"percentile_inc_nulls\": 0.9769999980926514, \"value_count\": 23, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 23.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.9377289414405823, \"percentile_inc_nulls\": 0.9490000009536743, \"value_count\": 14, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 28.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.9059829115867615, \"percentile_inc_nulls\": 0.9229999780654907, \"value_count\": 13, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 26.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.891330897808075, \"percentile_inc_nulls\": 0.9110000133514404, \"value_count\": 12, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 12.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.8778998851776123, \"percentile_inc_nulls\": 0.8999999761581421, \"value_count\": 11, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 11.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.8534798622131348, \"percentile_inc_nulls\": 0.8799999952316284, \"value_count\": 10, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 20.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.831501841545105, \"percentile_inc_nulls\": 0.8619999885559082, \"value_count\": 9, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 18.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.7728937864303589, \"percentile_inc_nulls\": 0.8140000104904175, \"value_count\": 8, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 48.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.721611738204956, \"percentile_inc_nulls\": 0.7720000147819519, \"value_count\": 7, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 42.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 6, \"group_name\": \"_surname_\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 48.0, \"distinct_value_count\": 371}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\", \"type\": \"quantitative\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Count of values\", \"type\": \"quantitative\"}}, \"title\": {\"text\": \"Distribution of counts of values in column \\\"surname\\\"\", \"subtitle\": \"In this col, 181 values (18.1%) are null and there are 371 distinct values\"}}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 23, \"group_name\": \"_surname_\", \"value\": \"Jones\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 14, \"group_name\": \"_surname_\", \"value\": \"Taylor\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 14, \"group_name\": \"_surname_\", \"value\": \"Davies\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 13, \"group_name\": \"_surname_\", \"value\": \"Hall\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 13, \"group_name\": \"_surname_\", \"value\": \"Campbell\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 12, \"group_name\": \"_surname_\", \"value\": \"Morgan\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 11, \"group_name\": \"_surname_\", \"value\": \"Smith\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 10, \"group_name\": \"_surname_\", \"value\": \"Russell\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 10, \"group_name\": \"_surname_\", \"value\": \"King\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 9, \"group_name\": \"_surname_\", \"value\": \"Dixon\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Top 10 values by value count\"}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"_surname_\", \"value\": \"Griffihs\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 1, \"group_name\": \"_surname_\", \"value\": \"Atkinnos\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 1, \"group_name\": \"_surname_\", \"value\": \"Kra\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 1, \"group_name\": \"_surname_\", \"value\": \"BBird\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 1, \"group_name\": \"_surname_\", \"value\": \"Hsmel\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"scale\": {\"domain\": [0, 23]}, \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Bottom 5 values by value count\"}]}, {\"hconcat\": [{\"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"data\": {\"values\": [{\"percentile_ex_nulls\": 0.37699997425079346, \"percentile_inc_nulls\": 0.37699997425079346, \"value_count\": 2, \"group_name\": \"_dob_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 132.0, \"distinct_value_count\": 565}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.0, \"value_count\": 1, \"group_name\": \"_dob_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 377.0, \"distinct_value_count\": 565}, {\"percentile_ex_nulls\": 0.9099999666213989, \"percentile_inc_nulls\": 0.9099999666213989, \"value_count\": 6, \"group_name\": \"_dob_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 90.0, \"distinct_value_count\": 565}, {\"percentile_ex_nulls\": 0.8050000071525574, \"percentile_inc_nulls\": 0.8050000071525574, \"value_count\": 5, \"group_name\": \"_dob_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 105.0, \"distinct_value_count\": 565}, {\"percentile_ex_nulls\": 0.652999997138977, \"percentile_inc_nulls\": 0.652999997138977, \"value_count\": 4, \"group_name\": \"_dob_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 152.0, \"distinct_value_count\": 565}, {\"percentile_ex_nulls\": 0.5090000033378601, \"percentile_inc_nulls\": 0.5090000033378601, \"value_count\": 3, \"group_name\": \"_dob_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 144.0, \"distinct_value_count\": 565}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 2, \"group_name\": \"_dob_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 132.0, \"distinct_value_count\": 565}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\", \"type\": \"quantitative\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Count of values\", \"type\": \"quantitative\"}}, \"title\": {\"text\": \"Distribution of counts of values in column \\\"dob\\\"\", \"subtitle\": \"In this col, 0 values (0.0%) are null and there are 565 distinct values\"}}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 6, \"group_name\": \"_dob_\", \"value\": \"2017-01-11\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 6, \"group_name\": \"_dob_\", \"value\": \"1990-03-08\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 6, \"group_name\": \"_dob_\", \"value\": \"2010-06-22\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 6, \"group_name\": \"_dob_\", \"value\": \"1972-06-12\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 6, \"group_name\": \"_dob_\", \"value\": \"2009-07-06\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 6, \"group_name\": \"_dob_\", \"value\": \"2011-05-26\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 6, \"group_name\": \"_dob_\", \"value\": \"2008-02-23\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 6, \"group_name\": \"_dob_\", \"value\": \"2000-04-03\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 6, \"group_name\": \"_dob_\", \"value\": \"2006-12-21\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 6, \"group_name\": \"_dob_\", \"value\": \"1995-06-21\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Top 10 values by value count\"}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"_dob_\", \"value\": \"1971-05-24\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 1, \"group_name\": \"_dob_\", \"value\": \"1977-10-17\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 1, \"group_name\": \"_dob_\", \"value\": \"1976-08-15\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 1, \"group_name\": \"_dob_\", \"value\": \"1996-08-05\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}, {\"value_count\": 1, \"group_name\": \"_dob_\", \"value\": \"1996-11-10\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 565}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"scale\": {\"domain\": [0, 6]}, \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Bottom 5 values by value count\"}]}, {\"hconcat\": [{\"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"data\": {\"values\": [{\"percentile_ex_nulls\": 0.787207841873169, \"percentile_inc_nulls\": 0.8270000219345093, \"value_count\": 173, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 173.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.7380073666572571, \"percentile_inc_nulls\": 0.7870000004768372, \"value_count\": 40, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 40.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.6974169611930847, \"percentile_inc_nulls\": 0.7540000081062317, \"value_count\": 33, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 33.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.6715867519378662, \"percentile_inc_nulls\": 0.7330000400543213, \"value_count\": 21, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 21.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.6494464874267578, \"percentile_inc_nulls\": 0.7150000333786011, \"value_count\": 18, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 18.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.6076260805130005, \"percentile_inc_nulls\": 0.6809999942779541, \"value_count\": 17, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 34.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.5682656764984131, \"percentile_inc_nulls\": 0.6489999890327454, \"value_count\": 16, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 32.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.5166051387786865, \"percentile_inc_nulls\": 0.6069999933242798, \"value_count\": 14, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 42.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.48462486267089844, \"percentile_inc_nulls\": 0.5809999704360962, \"value_count\": 13, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 26.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.4403443932533264, \"percentile_inc_nulls\": 0.5449999570846558, \"value_count\": 12, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 36.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.41574418544769287, \"percentile_inc_nulls\": 0.5249999761581421, \"value_count\": 10, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 20.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.3936039209365845, \"percentile_inc_nulls\": 0.5069999694824219, \"value_count\": 9, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 18.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.3640836477279663, \"percentile_inc_nulls\": 0.4829999804496765, \"value_count\": 8, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 24.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.31242311000823975, \"percentile_inc_nulls\": 0.44099998474121094, \"value_count\": 7, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 42.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.2829028367996216, \"percentile_inc_nulls\": 0.4169999957084656, \"value_count\": 6, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 24.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.25830256938934326, \"percentile_inc_nulls\": 0.3970000147819519, \"value_count\": 5, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 20.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.24354243278503418, \"percentile_inc_nulls\": 0.38499999046325684, \"value_count\": 4, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 12.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.2287822961807251, \"percentile_inc_nulls\": 0.37300002574920654, \"value_count\": 3, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 12.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.19680196046829224, \"percentile_inc_nulls\": 0.34700000286102295, \"value_count\": 2, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 26.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.18699997663497925, \"value_count\": 1, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 160.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 173, \"group_name\": \"_city_\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 173.0, \"distinct_value_count\": 218}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\", \"type\": \"quantitative\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Count of values\", \"type\": \"quantitative\"}}, \"title\": {\"text\": \"Distribution of counts of values in column \\\"city\\\"\", \"subtitle\": \"In this col, 187 values (18.7%) are null and there are 218 distinct values\"}}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 173, \"group_name\": \"_city_\", \"value\": \"London\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 40, \"group_name\": \"_city_\", \"value\": \"Birmingham\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 33, \"group_name\": \"_city_\", \"value\": \"Liverpool\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 21, \"group_name\": \"_city_\", \"value\": \"Coventry\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 18, \"group_name\": \"_city_\", \"value\": \"Newcastle-upon-Tyne\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 17, \"group_name\": \"_city_\", \"value\": \"Manchester\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 17, \"group_name\": \"_city_\", \"value\": \"Leeds\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 16, \"group_name\": \"_city_\", \"value\": \"Aberdeen\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 16, \"group_name\": \"_city_\", \"value\": \"Bristol\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 14, \"group_name\": \"_city_\", \"value\": \"Southend-on-Sea\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Top 10 values by value count\"}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"_city_\", \"value\": \"Mancheeser\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 1, \"group_name\": \"_city_\", \"value\": \"Mancseter\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 1, \"group_name\": \"_city_\", \"value\": \"Birmimghan\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 1, \"group_name\": \"_city_\", \"value\": \"Norwhi\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 1, \"group_name\": \"_city_\", \"value\": \"nLondon\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"scale\": {\"domain\": [0, 173]}, \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Bottom 5 values by value count\"}]}, {\"hconcat\": [{\"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"data\": {\"values\": [{\"percentile_ex_nulls\": 0.9619771838188171, \"percentile_inc_nulls\": 0.9700000286102295, \"value_count\": 6, \"group_name\": \"_email_\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 30.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 0.8479087352752686, \"percentile_inc_nulls\": 0.8799999952316284, \"value_count\": 5, \"group_name\": \"_email_\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 90.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 0.6653992533683777, \"percentile_inc_nulls\": 0.7360000014305115, \"value_count\": 4, \"group_name\": \"_email_\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 144.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 0.47148287296295166, \"percentile_inc_nulls\": 0.5830000042915344, \"value_count\": 3, \"group_name\": \"_email_\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 153.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 0.32446134090423584, \"percentile_inc_nulls\": 0.46700000762939453, \"value_count\": 2, \"group_name\": \"_email_\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 116.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.21100002527236938, \"value_count\": 1, \"group_name\": \"_email_\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 256.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 6, \"group_name\": \"_email_\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 30.0, \"distinct_value_count\": 424}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\", \"type\": \"quantitative\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Count of values\", \"type\": \"quantitative\"}}, \"title\": {\"text\": \"Distribution of counts of values in column \\\"email\\\"\", \"subtitle\": \"In this col, 211 values (21.1%) are null and there are 424 distinct values\"}}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 6, \"group_name\": \"_email_\", \"value\": \"omoore64@randall.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 6, \"group_name\": \"_email_\", \"value\": \"j.williams@levine-johnson.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 6, \"group_name\": \"_email_\", \"value\": \"jessica.miller@johnson.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 6, \"group_name\": \"_email_\", \"value\": \"iwilkinson@bush.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 6, \"group_name\": \"_email_\", \"value\": \"fb@nelson.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 5, \"group_name\": \"_email_\", \"value\": \"t.m39@brooks-sawyer.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 5, \"group_name\": \"_email_\", \"value\": \"riley.k@zavala.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 5, \"group_name\": \"_email_\", \"value\": \"dh@powell.net\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 5, \"group_name\": \"_email_\", \"value\": \"lily.robinson73@peterson.biz\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 5, \"group_name\": \"_email_\", \"value\": \"l.armstrong@reyes-campbell.net\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Top 10 values by value count\"}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"_email_\", \"value\": \"l.c91@perez-gonzalez.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 1, \"group_name\": \"_email_\", \"value\": \"o.griffiths90@reyes-coleman.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 1, \"group_name\": \"_email_\", \"value\": \"r.b80@ellis-berry.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 1, \"group_name\": \"_email_\", \"value\": \"js37@rodriguez-vega.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 1, \"group_name\": \"_email_\", \"value\": \"lucas@marrington.coh\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"scale\": {\"domain\": [0, 6]}, \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Bottom 5 values by value count\"}]}], \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.9.3.json\"}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.VConcatChart(...)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from splink.exploratory import profile_columns\n",
    "\n",
    "profile_columns(df, db_api=DuckDBAPI(), top_n=10, bottom_n=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c5d3a2d",
   "metadata": {},
   "source": [
    "This chart is very information-dense, but here are some key takehomes relevant to our linkage:\n",
    "\n",
    "- There is strong skew in the `city` field with around 20% of the values being `London`. We therefore will probably want to use `term_frequency_adjustments` in our linkage model, so that it can weight a match on London differently to a match on, say, `Norwich`.\n",
    "\n",
    "- Looking at the \"Bottom 5 values by value count\", we can see typos in the data in most fields. This tells us this information was possibly entered by hand, or using Optical Character Recognition, giving us an insight into the type of data entry errors we may see.\n",
    "\n",
    "- Email is a much more uniquely-identifying field than any others, with a maximum value count of 6. It's likely to be a strong linking variable.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c537f065-f61f-402a-ad2c-a8575b6207ab",
   "metadata": {},
   "source": [
    "!!! note \"Further Reading\"\n",
    "\n",
    "    :material-tools: For more on exploratory analysis tools in Splink, please refer to the [Exploratory Analysis API documentation](../../api_docs/exploratory.md).\n",
    "\n",
    "    :bar_chart: For more on the charts used in this tutorial, please refer to the [Charts Gallery](../../charts/index.md#exploratory-analysis).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f37cb1e",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "At this point, we have begun to develop a strong understanding of our data. It's time to move on to estimating a linkage model\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
